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Abstract 

We investigate the generalizability of learned 
binary relations: functions that map pairs of 
instances to a logical indicator. This problem 
has application in numerous areas of machine 
learning, such as ranking, entity resolution 
and link prediction. Our learning framework 
incorporates an example labeler that, given a 
sequence X ofn instances and a desired train- 
ing size m, subsamples m pairs from X x X 
without replacement. The challenge in ana- 
lyzing this learning scenario is that pairwise 
combinations of random variables are inher- 
ently dependent, which prevents us from us- 
ing traditional learning-theoretic arguments. 
We present a unified, graph-based analysis, 
which allows us to analyze this dependence 
using well-known graph identities. We are 
then able to bound the generalization error 
of learned binary relations using Rademacher 
complexity and algorithmic stability. The 
rate of uniform convergence is partially deter- 
mined by the labeler's subsampling process. 
We thus examine how various assumptions 
about subsampling affect generalization; un- 
der a natural random subsampling process, 
our boimds guarantee 0(l/\/n) uniform con- 
vergence. 



1 Introduction 

We investigate the generalizability of a learned binary 

relation: a characteristic function r : X"^ — > {±1} 
that indicates whether the input pair satisfies a re- 
lation. For example, if r is an equivalence relation. 



then r(x,x') = 1 indicates that x = x'\ if r is a total 
ordering, then r{x,x') = 1 means that x < x'. Binary 
relations are found in many learning problems, such 
as ranking (total ordering), entity resolution (equiva- 
lence) and link prediction. There has been significant 
research in each of these fields individually, yet no uni- 
fied view of the learning problem. We formulate the 
learning objective (in Section 2) as inductive inference 
on a product space X^, where instances are drawn 
from an arbitrary distribution over X. Given a set of 
n independently and identically distributed (i.i.d.) in- 
stances X E as well as a subset from X'^ that has 
been labeled by r, our goal is to produce a hypothesis 
h : X^ {±1} that, with high probability, has low 
error w.r.t. r. 

The primary challenge in analyzing this learning setup 
is reasoning about the dependency structure of a pair- 
wise training set. Even though the instances are i.i.d., 
the training examples may not be; since each instance 
may appear in multiple examples, those that involve 
a common instance are necessarily dependent. This 
dependence makes the analysis of generalization non- 
trivial, since it violates the fundamental assumption of 
independence used in classical statistics and learning 
theory, rendering existing results incompatible. When 
the training set does not simply contain all pairs of 
instances, its dependency structure is difficult to ana- 
lyze. 

In Section 3, we introduce graphical representations 
of the training set and its dependency structure. The 
training set is viewed as a graph G, in which each 
instance is a vertex and each pairwise example is an 
(undirected) edge. The dependency structure is repre- 
sented by the corresponding line graph Gd , since edges 
that share a common vertex (instance) are adjacent in 
Gd. Casting the dependency structure as a graph al- 
lows us to leverage the vast literature of graph theory, 
which makes our analysis cleaner. The "amount of 
dependence" can be quantified by the chromatic num- 
ber of Gn, since the chromatic number is the mini- 
mal number of independent sets. Using well-known 
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chromatic properties, wc are able to upper bound this 
quantity. We then use graph-based arguments in Sec- 
tion 4 to bound the generalization error of learned 
binary relations, using Rademacher complexity or al- 
gorithmic stability. We show that the empirical error 
converges to the expected error at a rate of 0{yj p/m), 
where m is the number of labeled training pairs and p 
is the maximum frequency of any instance in this set. 
This ratio depends on how pairs are subsampled (irre- 
spective of their values). Consequently, in Section 5, 
we explore several learning scenarios in which the sub- 
sampling process is working for, against, or is agnostic 
to the learner (i.e., random). Using a reduction to a 
random graph model, we prove interesting results for 
the agnostic scenario. We thus provide a novel analysis 
of the relationship between subsampling and the rate 
of uniform convergence for learned binary relations. 

1.1 Related Work 

Goldman et al. [1993] were the first to address learn- 
ing a binary relation. Their learning setting is entirely 
different from ours, in that there is no underlying dis- 
tribution from which instances are generated, and they 
are not interested in generalization to unseen data. In 
their setting, the problem size is polynomial in a finite 
number of objects. Our learning setting subsumes this 
one. 

Our main analytic tools come from two areas of learn- 
ing theory: Rademacher complexity and algorithmic 
stability. The canonical Rademacher bounds are due 
to Koltchinskii and Panchenko [2002] and Bartlett and 
Mendelson [2003] , while the canonical stability bounds 
are due to Bousquet and Elisseeff [2002] . Though all of 
these techniques fundamentally rely on independence, 
there have been recent applications to non-i.i.d. prob- 
lems. Inspired by Janson [2004], Usunier et al. [2006] 
developed a theory of chromatic Rademacher analysis, 
using it to derive generic risk bounds for dependent 
data, with bipartite ranking as a motivating exam- 
ple. (Ralaivola et al. [2010] used a similar chromatic 
analysis to derive PAC-Bayes bounds.) Our analysis 
draws on this work, though we delve deeper into the 
setting of pairwisc prediction. More recently, Mohri 
and Rostamizadeh [2009, 2010] used an "independent 
blocking'' technique to derive both Rademacher- and 
stability-based risk bounds for time series data drawn 
from a strongly-mixing stationary process, though this 
is an entirely different form of dependence. 

A number of authors [Bar-Hillel and Weinshall, 2003, 
Clemengon et al., 2008, Jin et al., 2009] have developed 
risk bounds for problems involving pairwise prediction 
(or learning with pairwise constraints), though tlicy 
all assume that the training set contains all pairwisc 
combinations of the instances, ignoring the more inter- 



esting problem of learning from a (sparse) subsample. 
Our results are not only more general, but they offer 
insight into how the subsampling process affects the 
rate of convergence. Agarwal and Niyogi [2009] con- 
sider a random subsampling process similar to one we 
discuss and derive risk bounds for ranking using al- 
gorithmic stability. We provide an alternate analysis 
that is clean, natural and interpretable. 

2 Preliminaries 

This section introduces the necessary background con- 
cepts. To clarify certain probabilities, we let Prtu[-] 
denote the probability of an event w.r.t. a random 
draw of w from an implicit sample space f2. Similarly, 
let [•] denote the expectation of a random variable 
w.r.t. a random draw of w G f2, according to an im- 
plicit probability distribution. 

2.1 Problem Setting 

Let X denote an abstract instance space (e.g., X could 
be a subset of Euclidean space) . We are interested in 
learning a binary relation r : X^ {il}- We will 
limit our analysis to relations that are reflexive, and 
either symmetric or antisymmetric, examples of which 
include equivalence and total ordering. 

We define the learning process as follows. We are given 
a sequence of instances X = {x\,. . . , Xn) S A"", drawn 
independently and identically from an arbitrary dis- 
tribution over X . We are also given access to a la- 
heler S, a black-box process that returns a subset of 
pairwise labeled examples. More precisely, given an 
input X and a training size m, Sm{X) returns some 
subset of X^ (sampled without replacement) that has 
been labeled according to r. In practice, this could be 
a crowd-sourcing application or a targeted surveying 
process. The subset returned by Sm{X) is sampled ac- 
cording to a process that is independent of the instance 
values X. Restricting our theory to consider only la- 
belers that determine which pairs to label, indepen- 
dent of X, ensures that the observed data is drawn 
from the original distribution; otherwise, the labeler 
could introduce bias. Within this restriction, the sub- 
sampling process remains unknown. It may be adver- 
sarial or bc;nign, aiming to either weaken or improve 
generalization; it may also be agnostic, selecting pairs 
according to some random process. We will return to 
the topic of subsampling in Section 5. 

Due to our assumption of (anti)symmetry, only one ex- 
ample per pair is required, since the converse (or con- 
trapositive) can be inferred from context. This means 
that m is naturally upper bounded by (2) . Upon re- 
ceiving a labeled dataset Z ^ Sm{X), we invoke a 
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learning algorithm A to obtain an inductive hypothe- 
sis h from a class H of functions from to {±1}. We 
let hz (or, more explicitly, hz A{Z)) denote the 
hypothesis trained on the dataset Z. 

2.2 Generalization 

For a given cost function c : — >■ M"*", define the loss 
£ of a, hypothesis h w.r.t. a pair z = {x, x') as £{h, z) = 
c(r(z), h{zy). Though there arc various cost functions, 
for learning binary relations, we are most interested in 
the so-called 0-1 loss, ii{h,z) = l[r{z) ^ h{z)]. 

The quantity we are primarily concerned with is the 

generalization error, or risk. This is the expected 
loss of a learned hypothesis hz w.r.t. a random pair 
z = {x,x'), where x and x' are sampled independently 
and identically from the underlying distribution over 
X. We denote this quantity by R{hz) = Ez[£{hz,z)]. 
In practice, we can estimate the true risk with the 
training error, compiitcd as the average loss over all 
examples in the training set. We denote this by 

Given an empirical risk estimate, we would like to 

bound its deviation from the true risk. We will re- 
fer to this quantity as the defect, which we denote by 
D{hz) — R{hz) — R{hz)- We will use the canoni- 
cal probably approximately correct (PAC) framework 
[Valiant, 1984], in which we allow the defect to exceed 
an arbitrary e > with probability 5 G (0, 1). There 
are a number of ways of analyzing this probability, of 
which we will focus on hypothesis complexity and al- 
gorithmic stability. Analyses of this nature typically 
aim to bound 



Pr 



sup D{h) > e 
hen 



or Pr 



sup D{hz) > e 

.hz<^Az 



Solving for e, one obtains a probabilistic upper bound 
on the true risk, parameterized by S. 

3 Graphical Representation 

In this section, we analyze the learning problem and 
its inherent dependencies using a graphical represen- 
tation. 

3.1 Training Data 

Recall that the training set, returned by the labeler 
S, is an arbitrary subset of X'^ that has been la- 
beled according to r. We can represent this as a 
graph G = {V,E), in which each vertex Vi G V cor- 
responds to an instance Xi, and each training exam- 
ple (xi,Xj,r{xi,Xj)) € Z defines an (un)directed edge 
G E. Thus, the subsampling pattern reduces to 



a graph on the instances X. Note that, if r is an- 
tisymmetric, then G will be directed; however, since 
only one example per pair is needed, the correspond- 
ing undirected graph is simple (i.e., not a multigraph). 
This graph representation simplifies the analysis of 
generalization and allows the labeler's subsampling to 
be viewed as one of many well-studied graph genera- 
tion processes. 

For example, we can consider an agnostic, random la- 
beler that selects pairs uniformly at random. Using 
the above graphical representation, this subsampling 
process can be modeled by the Erdos-Renyi random 
graph model Q{n,m). In this model, a graph with n 
vertices and m edges is chosen uniformly from the set 
of all such graphs. (This model differs slightly from the 
popular G{n,p) model, in which each edge is realized 
independently with probability p.) We analyze gener- 
alization using various labeler scenarios, including the 
Q{n,m) labeler, in Section 5. 

3.2 Dependency Structure 

For each instance Xi, define a random variable Xi, 
and recall that these are i.i.d. For each example 
pair {xi,Xj) found in Z, define a random variable 
Zi j = {Xi,Xj). Because each instance may appear 
in multiple pairs, we have that these random variables 
are not mutually independent. To make this more 
concrete, consider the set {^1,2, ^2,3, ■^3.4}- Clearly, 
Zi^2 and ^2,3 are dependent, since they include a 
common variable, X2. Now consider variables Zi^2 
and and note that {1,2} n {3,4} = 0. Ob- 

serving (Xi, X2) reveals nothing about (X3, X4); thus, 
P(Z3^4 I .^1,2) = P(-^3,4), and vice versa, so they are 
mutually independent. 

We will represent the dependency structure using a 
graphical representation due to Erdos and Lovasz 
[1975], known as a dependency graph. 
Definition 1 (Dependency Graph). Let Z ^ {Zi}^^^ 
be a set of random variables with joint distribution 
P(Z), and let Gd = {V,E) be a graph, with V = [n]. 
Then Gd forms a dependency graph w.r.t. Z if every 
independent set (i.e., set of non-adjacent vertices) / C 
V satisfies P({^J.e/) = R^ei^i^^)- 

In the context of learning binary relations, we con- 
struct a dependency graph G^) in which each vertex 
Vij is connected to vertices {vkj : (k = i)\/ {£ = j)} — 
i.e., the set of vertices that involve instances Xi or Xj. 
As a result, any independent set of vertices (in the 
graph-theoretic sense) is a set of independent random 
variables. To better understand the structure of this 
dependency graph, it helps to recall the graph G de- 
fined in the previous section; Gd is in fact its corre- 
sponding line graph, a graph representing the adjacen- 
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cies between edges. ^ Therefore, every independent set 
in Gd is a matching in G. 

3.3 Chromatic Properties 

In this section, we discuss the chromatic properties of 
the above graphical representations, which will intro- 
duce a key lemma used in our generalization boimds. 
For completeness, we first review some background 
on graph coloring. For the following definitions, let 
G = (y, E) be an arbitrary undirected graph. 

Definition 2 (Vertex Coloring). A proper k-vertex- 
coloring (often simply referred to as simply a k- 
coloring) is a mapping from V to a set of k color 
classes, such that no two adjacent vertices have the 
same color. Equivalently, it is a partitioning {Cj : 
Cj C V}^^^, such that Ui=i<^i = V, n5=iCj = 0, 
and every subset Cj is independent. The chromatic 
number x(G) is the minimum number of colors needed 
to achieve a proper coloring. 

Definition 3 (Edge Coloring). A proper k-edge- 
coloring is a mapping from i5 to a set of k color classes, 
such that no two coincident edges have the same color. 
The chromatic index x'{G) is the minimum number of 
colors needed to achieve a proper edge coloring. 

Theorem 1 (Vizing, 1964). IfG has maximum degree 

A(G), then A(G) < x'(G) < A(G) + 1. 

For a dependency graph, the chromatic number can be 
viewed roughly as the "amount of dependence" . The 
lower the chromatic number, the more independence. 
If the variables are i.i.d., then the chromatic number is 
1. Coloring the vertices of the dependency graph Gd 
described in Section 3.2 can be reduced to coloring the 
edges of the graph G described in Section 3.1, since one 
is simply the line graph of the other. The chromatic 
index of G is equal to the chromatic number of G^i, 
so bounding one quantity bounds the other. Although 
there are many graphs for which x'(G) = A(G), de- 
termining the chromatic index of an arbitrary graph 
is NP-hard, so we will rely on the upper bound. We 
can therefore state the amount of dependence in terms 
of the chromatic index of G. Note; that the maximum 
degree A(G) is simply the maximum frequency of any 
instance in the training set, which yields the following 
lemma. 

Lemma 1. Let X € be a set of n i.i.d. instances, 
and Z -ir- Sm{X) an arbitrary training set of size m, 
with maximum instance frequency p. Let Gd be the 
corresponding dependency graph of Z. Then, x{Gd) < 
p+1. 



4 Generalization Bounds 

In this section, we develop risk bounds for learning bi- 
nary relations using both the Rademacher complexity 
and algorithmic stability. 

4.1 Concentration Inequalities 

A key component in any generalization analysis is the 

concentration of random variables. We now present 
two tail bounds that will be used in our proofs. 

Theorem 2 (McDiarmid, 1989). Let Z = {Zj^^i be 
a set of i.i.d. random variables that take values in Z. 

Let f : Z"' — > K &e a measurable function for which 
there exist constants {ai}"^i such that, for any i € [n], 
and any inputs Z, Z' G that differ only in the i*^ 
variable, \f{Z) — f{Z')\ < ai. Then, for any e > 0, 



Pr [/(Z)-E[/(Z)] >e] <exp 



Theorem 3 (Usunicr ct al., 2006). Let Gd be a de- 
pendency graph (Definition 1) for a set of random 
variables Z = that take values in Z. Let 

{Zj}^£^°' denote the subsets induced by an optimal 



proper coloring of Gd, and let Uj 



Finally, 



let f : Z" M. be a measurable function where: (a) 
there exist functions {fj : Z^^ — )• such that 

/(Z) = fj{Zij); (b) there exists a constant a such 
that every fj is a-Lipschitz w.r.t. the Hamming met- 
ric. Then, for any e> 0, 

Pr[/(Z)-E[/(Z)]>e]<exp(^^^^). 

4.2 Rademacher Complexity 

Informally, the Rademacher complexity measures a hy- 
pothesis class' expressive power, quantified by its abil- 
ity to fit a random signal. We slightly adapt the tra- 
ditional dcfiniticm to better suit our learning context. 

Definition 4 (Rademacher Complexity). Let X be an 
instance space, and Z some alternate space. Let (p : 
A"" — > denote a mapping from n instances from X 
to m instances from Z. Let cr be a set of Rademacher 
variables that independently take values in {±1} with 
equal probability. For a class T of functions from Z 
to M, define the empirical Rademacher complexity of 
T o^p, w.r.t. instances X e A"", as 



2 

— sup 



^If G is directed, then Gd is the line graph of its corre- 
sponding undirected graph. 



Define the Radem,acher com,plexity of J^ocp, w.r.t. i.i.d. 
samples of size n, as $H„(J^ ° — Kxi^xi-^ ° <f)]- 
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In our scenario, the data may exhibit dependencies; 
yet, as we will see, this does not affect the traditional 
symmetry argument used in Rademacher analysis. 

Theorem 4. Let X e A*" be an i.i.d. sample of n 
instances, and Z Sm{X) an arbitrary training set 
of size m, with maximum instance frequency p. Let T 
be a class of functions from Z = X"^ to [0, 1] . Then, for 
any n >2, any m > 1, any f ^ T , and any S G (0, 1), 
with probability at least 1 — 6 over draws of X, 



have via symmetry that 



zez 



9+1, 1 



(1) 



Proof With as the dependency graph of Z, we 
invoke an optimal proper coloring, which partitions 
Z into x{Gd) < P + 1 (via Lemma 1) independent 
sets. We can consequently express the defect D{f) = 
^z[f{z)] — :^J2zezf(^) ^ functions of 

independent random variables. It is easy to show that 
each of these functions is (1/TO)-Lipschitz w.r.t. the 
Hamming metric. We therefore apply Theorem 3 and 
obtain 



Pr 

X 



sup £>(/)- Ex [sup £»(/)] >e 



< exp 



P+1 



(2) 

To bound Ex[supjg^D(/)], we start by imagining a 
"ghost sample" X' of n i.i.d. instances that have been 
labeled rising the same pattern as Z to create Z' . Note 
that Kz[f{z)] = Ex' Ez'ez' /(^')] via linearity of 
expectation, since all z' have the same marginal distri- 
bution. Using the ghost sample and Jensen's inequal- 
ity, we have that 



Ex[supD(/)] =Ex 



E 



X 



sup Ex 



zez 



< E 



X,X' 



supE5[/(5)]-- V/(2) 
7tz 

^ rn 



i=l 



(3) 



Now define a set of Rademacher variables a, and de- 
fine random variables {Zi}"Li and {Z'^}"Li for Z and 
Z' sets respectively. Because they are labeled using 
the same pattern, and have isomorphic dependency 
graphs, we have that P(Zi, . . . , Z„j) = F{Z[, Z'„^). 
In fact, if we exchange any Zi and Z^', we have that 
P(Zi, ...,Zi,...,Zm)= ...,Z,,...,Z'J. Thus, 

since every draw of a occurs with equal probability, we 



Eq. (3) < Ex,x', 



^ m 



2 

sup — 
feJ^m 



<<K„(J-o5„). 



and so Ex [sup/gjr !?(/)] < 9l„(J" o Sm)- To obtain 
Equation 1, we simply set Equation 2 equal to 5 and 
solve for e. ■ 

It is possible to derive a similar risk bound for the em- 
pirical Rademacher complexity by applying Theorem 3 
to the difference of JH„(J" o Sm) - ° S„i)- We 

omit this proof to save space, since the remainder of 
our work does not require the empirical Rademacher 
complexity. 

To make these bounds more functional, we will re- 
place d\n with an empirically verifiable quantity. For 
certain hypothesis classes, it is possible to show that 
the Rademacher complexity is bounded by a function 
of the model parameters. One such class of hypothe- 
ses is reproducing kernel Hilbert spaces (RKHS), which 
subsume the popular support vector machine (SVM) 
[Cristianini and Shawe- Taylor, 2000]. Formally, for 
some mapping (p : Z ^ Z, where Z is an instance 
space and Z is a Hilbert space, endowed with an in- 
ner product (•,•), a kernel k : — >• M is a function 



such that, for all {z,z') e Z'^, k{z,z') 



z),<i>{z')). 



The only requirement is that the kernel's Gram ma- 



trix K : Ki 



K{zi,Zj) be symmetric and positive 



semidefinite. RKHS hypotheses are generally of the 
form hf^^z{z) — ^'(-^' where Z € Z™ is a 

set of reference points from the problem domain (e.g., 
support vectors) . We may reasonably assume that the 
norm of the kernel mapping is uniformly bounded by a 
constant C; i.e., the mapped data is contained within 
a ball of radius C. We denote the class of kernel func- 
tions by "Hk- Borrowing results from Koltchinskii and 
Panchenko [2002] and Bartlett and Mendelson [2003], 
we are able to prove the following risk bounds for 
kernel-based hypotheses with bounded kernels, in the 
context of learning binary relations. 

For the following, we use the ramp loss as a surro- 
gate for the 0-1 loss. For a given 7 > 0, a real- 
valued hypothesis h : X"^ M and an example z, 
define the ramp loss as £^{h,z) = min{max{0, 1 — 
r{z)h{z)/^}, 1}. To differentiate this from the 0-1 loss 
when dealing with risk metrics, we henceforth use a 
superscript 1 or 7. 

Theorem 5. Let X e A"", Z <— SmiX) and p he as 
defined in Theorem 4- Let h^^z be a RKHS hypothesis 
trained on Z, such that sup^ ||(/)(z)|| < C. Then, for 
any n > 2, any m > 1, any 7 > 0, and any d G (0, 1), 
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with probability at least 1 — S over draws of X, 



B}{K,z)<R\K,z) + 



AC 



7V™ 



+ 



1 



(4) 



Proof Note that (.^ dominates ^i, and thus 

R^{h^_z) < R^ihn.z)- Since the ramp loss is bounded 
by [0, 1], we apply Theorem 4 and obtain 



RHhz) < IV{hz)+^n{t^oH^oSm) + 



— 14- 

2m 5 



To bound $K„(^-^ o T-L^ o Sm), we use Talagrand's con- 
traction lemma [Ledoux and Talagrand, 1991] and 
borrow a result from Bartlett and Mendelson [2003, 
Lemma 22], from which we obtain 



< - ^n{n^oSrr,)< 

7 7 



2 2ytr(K) 



m 



Using Cauchy-Schwarz, we can bound the trace of the 
kernel's Gram matrix as 

tr(K) = < J2 m^)\f < 



z^Z 



z^Z 



We therefore have that fH„(^^ oU^o Sm) < ■ 

Note that this analysis slightly improves upon that 
of Usunier et al. [2006] in that we use the regu- 
lar Rademacher complexity instead of their so-called 
fractional Rademacher complexity. Because of this, 
our Rademacher term is 0{l/y/m), compared to 
0{^/]}Jrn) using the fractional Rademacher complex- 
ity; note that l/\/m < \fpjm for all p > 1. 

4.3 Algorithmic Stability 

This section derives a different generalization bound 
for learning pairwise relations using our previous graph 

representations and algorithmic stability. 

Definition 5 (Uniform Stability). For a training set 
Z, let Z' be a duplicate of Z with the i*^ example 
removed. A learning algorithm A has uniform .stability 
/3 w.r.t. a loss function i if, for any Z e Z"^, and 
any i G [m], A returns hypotheses hz A{Z) and 
hz> <- A{Z') such that 

sup \i{hz,z)-i{hz',z)\<p. 

zez 

In other words, excluding any single example from 
training will increase the loss, w.r.t. any test example 
z, by at most /3. Of course, /3 must be a function of the 
size of the training set; indeed, we will later show that 
generalization is only possible when /3 = 0(1/to). To 
highlight this dependence, we henceforth use the nota- 
tion /3rn- Using this notion of stability, we now derive 
alternate risk bounds for learning binary relations. 



Lemma 2 (Bousquet and Elisseeff, 2002). Let A be 
a learning algorithm with uniform stability j3 w.r.t. a 
loss function I, where I is upper bounded by M. Then, 
for any i G [m] , and any training sets Z, Z' G 2™ that 
differ only in the value of the example, A returns 
hypotheses hz A{Z) and hz' ^ A{Z', G) such that 



\D{hz)-D{hz')\<APm + 



M 



m 



Theorem 6. Let X e be an i.i.d. sample of n 

instances, and Z 4- Sm{X) an arbitrary training set 
of size m, with maximum instance frequency p. Let A 
be a learning algorithm with uniform stability P w.r.t. 
a loss function £ ( upper bounded by M ), and let hz 
A{Z). Then, for any n > 2, any m > 1, and any 
5 e (0, 1), with probability at least 1 — 6 over draws of 
X, 



R{hz) < R{hz) + Apl3m + {^mpr^ 



-M),lU.\. (5) 



Proof We begin by showing that the defect satisfies 

the conditions of McDiarmid's inequality (Theorem 2). 
By Lemma 2, replacing any single example will change 
the defect by at most -|- M/m. However, if we re- 
place any single instance this will affect up to pi 
examples, where pi denotes the frequency of Xi in the 
training set. We therefore have that replacing any Xj 
has Lipschitz constant on = pi{4:(3„i + M/m). Since 
the instances are i.i.d., we can apply McDiarmid's in- 
equality and obtain 



PT[D{hz)-Ex[D{hz)]] 



< exp 

< exp 



-2e^ 



= exp 



-2e^m^ 

(4m/3„ + M)2pEr=iPi 

—em 



(4m/3„ + M)V; ■ 



The last line is due to the handshaking lemma, which 
states that the sum of the degrees (i.e., instance fre- 
quencies) in a graph is equal to twice the number of 
edges (i.e., examples). Setting the above equal to S 
and solving for e, we have that 



R{hz) < R{hz)+Ex[D{hz)] + {4:mpm + M) 



m 



S' 



What remains is to upper bound the expected defect. 
Using linearity of expectation, we can state this as 

^ jn 

Ex[D{hz)] = -y^Ex,Mhz,z)-£{hz,Zi)]. 
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Note that example; z,; = (x,a;') depends on any cixam- 
ples that share either x or x' , i.e., its neighborhood 
N{zi) in the dependency graph. However, by remov- 
ing Zi and M{zi) from the training set, Zi becomes in- 
dependent of any of the remaining examples. Accord- 
ingly, let Z' = Z\{zi,J^{zi)}, and let hz' be the result- 
ing hypothesis. Because we have removed dcg(zi) + 1 
examples from training, by uniform stability, we pay a 
penalty of at most I3m{<ieg{zi) + 1) loss per prediction. 
Further, recall that Gd is the line graph of the graph 
G defined in Section 3.1. It is therefore easy to show 
that any node in (i.e., edge in G) has degree equal 
to the sum of the degrees of its endpoints in G, minus 
two; hence, deg{zi) + l = deg(a;) +deg(a;') — 1 < 2p—l. 
We therefore have that 

Ex[D{hz)] < -^Ex,5[£(/iz',^)-^(/iz',^i)] 



i=l 



+ 2pm{deg{zi) + 1) 
m 



-J2^ + dcg{zi) + l 



< 2/3m{2p - I) < 4pPm, 

where the second line follows from symmetry, since z 
and Zi are now i.i.d. variables. ■ 

To obtain a non-vacuous bound, we require that = 
0(l/m). This precludes stability w.r.t. the 0-1 loss £i, 
since any algorithm will have either /3 = or ^ = 1, 
regardless of m. We therefore use the ramp loss 
again. To use the ramp loss, we introduce the notion of 
classification stability. For the following, we consider 
learning algorithms that output a real-valued hypoth- 
esis h : Z ^R, where sgn{h{z)) is the predicted label 
of z. 

Definition 6 (Classification Stability). Let Z and Z' 
be as defined in Definition 5. A learning algorithm A 

has classification stability /3 if, for any Z G Z"^, and 
any i e [m], A returns real- valued hypotheses hz ^ 
A{Z) and hz' ^ A{Z') such that 



sup \hz{z) 



hz'{z)\<l3. 



Lemma 3 (Bousquet and Elisseeff, 2002). A learn- 
ing algorithm with classification stability ^ has uni- 
form stability /3/7 w.r.t. the ramp loss 
Theorem 7. Let X e A"*, Z Sm{X) and p be as 
defined in Theorem 6. Let A be a learning algorithm 
with classification stability /3, and let hz A{Z). 
Then, for any n>2, any m> 1, any 7 > 0, and any 
6 G (0, 1), with probability at least 1 — 5 over draws of 
X, 



R^{hz)<R''{hz) 



1 



m 



5 
(6) 



Proof The proof follows directly from Theorem 6 and 
Lemma 3, with the ramp loss obviously upper bounded 
by M = 1. ■ 

The application of this bound still depends on a sta- 
bility parameter, which is unique to the learning algo- 
rithm. As a demonstration, we will focus on the class 
of kernel methods described in Section 4.2; specifically, 
SVM classification. Recall that, using stability analy- 
sis, generalization is made possible by properties of the 
learning algorithm, not the complexity of the hypothe- 
sis class. For the class of kernel methods in particular, 
this mechanism is regularization. We define a kernel- 
based regularization algorithm as one of the form 

AJZ) = argmin — V iih, z) + \ 



hen 



where A > is a regularization parameter. The loss 
function varies, depending on the application and al- 
gorithm. In SVM classification, it is common to min- 
imize the hinge loss, defined as i^ih, z) = max{0, 1 — 
r{z)h{zy\. Denote the empirical hinge risk by i?^. 
Using a stability result from Bousquet and Elisseeff 
[2002], we obtain a risk bound for SVM classification. 

Lemma 4 (Bousquet and Elisseeff, 2002). An SVM 

learning algorithm, with sup^ ||^(2;)|| < C and regu- 
larization parameter A > 0, has classification stability 
Pm < GV(2Am). 

Theorem 8. Let X e X'^, Z 5„(X) and p be as 

defined in Theorem 6. Let y4.„ be an SVM learning 
algorithm, with sup^ ||(/)(z)|| < C and A > 0, and let 
hK,z Ak{Z). Then, for any n>2, any m> 1, and 
any 6 G (0, 1), with probability at least 1 — 6 over draws 
ofX, 



i»'fc.z)<«'(..,z) + ^+(?^ + l);£.nl. 

(7) 

Proof Clearly, for 7 = 1, dominates FP . 
We therefore apply Theorem 7, with 7 = 1 and 
13 = GV(2Am). ■ 



4.4 Discussion of Bounds 

Our bounds are dominated by the term p/m, where p 
is the maximum instance frequency (equivalently, the 
maximum degree in G) and m is the number of exam- 
ples (edges). We refer to the inverse of this ratio as the 

effective training size. Letting pi denote the frequency 
(i.e., degree) of instance Xi, we have that 



p _ 2p _ 2 

i Pi Ei Pi/p 
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It is straightforward to show that this quantity is min- 
imized when G is regular — that is, when pi is uniform. 
In fact, for any regular G, we have that p/m = 2/n. 

Assuming one cannot acquire new examples, one can 
discard examples to make a regular graph, which gives 
the optimal ratio. This is equivalent to finding a k- 
regular spanning subgraph (i.e., /c-factor), for some 
k > 1. This is not always possible without reducing 
the number of instances (i.e., vertices), as some graphs 
do not admit such a fc-factor. For example, if the 
highest degree vertex is adjacent to multiple dcgrcc- 
1 vertices (as in a star graph), then certain vertices 
will be "isolated" when edges are removed. In fact, 
the effective number of instances n' is the order of the 
largest fc-regular induced subgraph, for some fc > 1; 
we therefore have that the effective training size is up- 
per bounded by n'/2. That said, identifying n' for an 
arbitrary graph is NP-hard. 

It is tempting to think that, by discarding examples 
to induce a 1-regular subgraph, one can reduce our 
learning setup to the i.i.d. scenario and consequently 
apply classical analysis. However, there may be a reg- 
ular (sub)graph of higher degree that yields a bet- 
ter effective training size. For instance, consider a 
graph consisting of t disjoint triangles (i.e., n = 2>t); 
this graph is already 2-regular, so without pruning 
edges it has an effective training size n/2 = it/ 2: if 
pruned to a 1-regular subgraph, the effective training 
size would be just t. Moreover, while the above shows 
that discarding examples might minimize our bounds, 
without intimate knowledge of the learning algorithm, 
hypothesis class or distribution, our bounds may be 
overly pessimistic in certain scenarios. Stronger as- 
sumptions may lead to tighter risk bounds, to support 
the intuition that more training data — albeit depen- 
dent data — will always improve generalization. 

5 On Subsampling and the Rate of 
Uniform Convergence 

We have shown that the empirical risk converges to the 
true risk at a rate of 0(-\/ p/m), depending primarily 
on the size of the training set and the maximum frc- 
qiicncy of any instance. While m may be determined 
by one's annotation or computation budget, p depends 
on the subsampling used to select the training set. In 
this section, we examine the relationship between the 
subsampling process used by the labeler and the rate 
of uniform convergence. 

Recall that the labeler cannot subsample based on the 
values of the input data, but it can subsample patterns 
that help or hurt generalization. If the labeler is work- 
ing against the learner, it can select pairs such that 



one instance appears in all training examples, mean- 
ing p = m. In this scenario, our bounds indicate that 
a hypothesis learned from this training set might not 
generalize. In contrast, if the labeler is working with 
the learner, it can subsample pairs so as to induce a 
regular label graph, as discussed in the previous sec- 
tion. This would yicild an optimal convergence rate of 
0(l/-y/n), comparable to classical results. 

We may also consider a setting in which the subsam- 
pling is a random process. For example, if the la- 
beler selects pairs uniformly at random without re- 
placement, then, as previously noted, this process can 
be modeled by the Erdos-Renyi random graph model. 
We then have that the rate of convergence is a func- 
tion of the maximum degree of G{n,m), since this is 
equivalent to the maximum instance frequency. 

Lemma 5. Let G = {V, E) be a graph in Q{n, m), for 
a given n and m. Then, with probability at least 1 — S, 
its maximum degree A(G) is upper bounded as 

We provide the proof in Appendix A. 

Using this as an upper bound for p (since p = A(G)), 

we obtain the following corollary of Theorem 5. A 
similar result can be shown for Theorem 8. 

Theorem 9. Let X eX"", Z ^ Sm{X), h^,z, C and 
A be as previously defined. If S samples m examples 
uniformly at random from X"^ , then, for any n > 2, 
any m > n/2, any 7 > 0, and any 5 e (0, 1), with 
probability at least 1 — 5 over draws of X, and r] = 

R\K,z) < IV{K,z) + ^ + \/^ln^. (9) 

Proof Equation 9 follows from Theorem 5 and 
Lemma 5, by allowing failure probability 5/2 to Equa- 
tion 4 and 5/2 to Equation 8. We then simplify the 
boimd by leveraging the fact that 1/m < 2/n. ■ 

We point out that these bounds have a natural inter- 
pretation. Whereas Agarwal and Niyogi [2009] invoke 
parameterized families of edge distributions, we con- 
sider a simple, intuitive learning setup in which the 
only parameter is the size m of the training set. If 
m > nlnn, then 77 = 0(1), and we obtain a uniform 
convergence rate of 0(l/\/n). For m > n/2, we have 
that r] = O(Vlnn), and the rate is still 0{l/y/n). 
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A Proof of Lemma 5 

For each pair of vertices {vi,Vj}, define a random 
variable Eij = € E]. Note that deg{vi) = 

Ylijjki ^''^d, via hnearity of expectation, 



E[deg(^;,)] = ^nE,j] = (n - 1) ■ 



2m 
n 



Since the expected degree is uniform, let ji = 
E[deg(t;j)]. Using the union bound, for any t > 0, 
we have that 

n 

Pr[A(G') >t\< ^Pr[deg(t;i) > t\ 



^Pr 



Though the variables {Eij}j^i are dependent, it is 
straightforward to show that they are negatively corre- 
lated. Therefore, we can apply multiplicative Chernoff 
bounds, with t — + e), to obtain 



Pr[A(G) >t]< Eexp(-;ueV3) = nexp 



i=l 



3n 



Setting this equal to S, then solving for e and t com- 
pletes the proof. 



